Welcome to Norfolk Data Science

Anomaly Detection

  • 2017-09-05
  • Christopher Brossman

General Methods for Anomaly Detection

  • Some threshold for deviation between expected and observed is surpassed.
    • Requires labeled data and need to build model
  • Completely automated systems which look for deviations

Timeseries model using twitter's AnomalyDetection


In [1]:
# First read in some data
library('tidyverse')
cr_hour <- read_csv('crh.csv')
cr_hour <- cr_hour %>% 
  mutate(ds = as.POSIXct(timestamp), cr_scaled = cr*100000) %>% #scaling to better understand expected value later on
  filter(month == 8) %>% 
  select(ds,cr_scaled)
head(cr_hour)


Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats
Parsed with column specification:
cols(
  Date = col_character(),
  year = col_integer(),
  month = col_integer(),
  day = col_integer(),
  hr = col_integer(),
  timestamp = col_datetime(format = ""),
  cr = col_double()
)
dscr_scaled
2017-08-01 00:00:001000
2017-08-01 01:00:001000
2017-08-01 02:00:001000
2017-08-01 03:00:002000
2017-08-01 04:00:001000
2017-08-01 05:00:001000

Lets look at the function

AnomalyDetectionTs(x, max_anoms = 0.1, direction = "pos", alpha = 0.05, only_last = NULL, threshold = "None", e_value = FALSE, longterm = FALSE, piecewise_median_period_weeks = 2, plot = FALSE, y_log = FALSE, xlabel = "", ylabel = "count", title = NULL, verbose = FALSE)

Arguments

  • x : Time series as a two column data frame where the first column consists of the timestamps and the second column consists of the observations.
  • max_anoms: Maximum number of anomalies that S-H-ESD will detect as a percentage of the data.
  • direction: Directionality of the anomalies to be detected. Options are: 'pos' | 'neg' | 'both'.
  • alpha: The level of statistical significance with which to accept or reject anomalies.
  • only_last: Find and report anomalies only within the last day or hr in the time series. NULL | 'day' | 'hr'.
  • threshold: Only report positive going anoms above the threshold specified. Options are: 'None' | 'med_max' | 'p95' | 'p99'.
  • e_value:Add an additional column to the anoms output containing the expected value.
  • longterm: Increase anom detection efficacy for time series that are greater than a month. See Details below.
  • piecewise_median_period_weeks: The piecewise median time window as described in Vallis, Hochenbaum, and Kejariwal (2014). Defaults to 2.
  • plot: A flag indicating if a plot with both the time series and the estimated anoms, indicated by circles, should also be returned.
  • y_log: Apply log scaling to the y-axis. This helps with viewing plots that have extremely large positive anomalies relative to the rest of the data.
  • xlabel: X-axis label to be added to the output plot.
  • ylabel: Y-axis label to be added to the output plot.
  • title: Title for the output plot.
  • verbose: Enable debug messages

In [2]:
library(AnomalyDetection)
ad = AnomalyDetectionTs(cr_hour, max_anoms=0.02, direction='both', 
                         plot=TRUE, e_value = TRUE,
                         xlabel = 'day', ylabel='Conversion Rate')

In [3]:
ad$anoms


timestampanomsexpected_value
2017-08-01 10:00:004000 1823
2017-08-09 10:00:004000 1713
2017-08-28 04:00:003000 789

In [4]:
options(repr.plot.width=7, repr.plot.height=4)
ad$plot



In [5]:
#wont always have an anomaly
library(AnomalyDetection)
res = AnomalyDetectionTs(cr_hour, max_anoms=0.02, direction='both', plot=TRUE, e_value = TRUE,
                         xlabel = 'day', ylabel='Conversion Rate',only_last='day')
res


$anoms
data frame with 0 columns and 0 rows

$plot
NULL

In [6]:
#lets see if it can detect a longer time period

cr_day <- read_csv('cr.csv')
cr_day <- cr_day %>% 
  mutate(d = as.POSIXct(d), cr = gsub('\t','',cr)) %>%
  mutate(cr_scaled = as.numeric(cr)*100000) %>% 
  select(d,cr_scaled)

library(AnomalyDetection)
res = AnomalyDetectionTs(cr_day, max_anoms=0.02, direction='both', plot=TRUE, e_value = TRUE,
                         xlabel = 'day', ylabel='Conversion Rate')
res


Parsed with column specification:
cols(
  d = col_date(format = ""),
  cr = col_character()
)
$anoms
            timestamp    anoms expected_value
1 2016-09-10 20:00:00 2332.764           1591
2 2016-11-24 19:00:00 4157.531           1420
3 2016-11-27 19:00:00 2671.162           1877

$plot

Lets checkout a single class SVM for multivariate anomaly prediction


In [7]:
#lets look at the good old iris data set
library(e1071);
library(caret);
library(NLP);
library(tm);

data(iris)

#split to see how well the predictions work
iris$SpeciesClass[iris$Species=="versicolor"] <- "TRUE"
iris$SpeciesClass[iris$Species!="versicolor"] <- "FALSE"
trainPositive<-subset(iris,SpeciesClass=="TRUE")
testnegative<-subset(iris,SpeciesClass=="FALSE")
inTrain<-createDataPartition(1:nrow(trainPositive),p=0.6,list=FALSE)

#take only true predictors into a single DF
trainpredictors<-trainPositive[inTrain,1:4]
trainLabels<-trainPositive[inTrain,6]

testPositive<-trainPositive[-inTrain,]
testPosNeg<-rbind(testPositive,testnegative)

testpredictors<-testPosNeg[,1:4]
testLabels<-testPosNeg[,6]

#see what this looks like
print(head(iris))
print('')
print(head(trainpredictors))


Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift


Attaching package: 'NLP'

The following object is masked from 'package:ggplot2':

    annotate

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species SpeciesClass
1          5.1         3.5          1.4         0.2  setosa        FALSE
2          4.9         3.0          1.4         0.2  setosa        FALSE
3          4.7         3.2          1.3         0.2  setosa        FALSE
4          4.6         3.1          1.5         0.2  setosa        FALSE
5          5.0         3.6          1.4         0.2  setosa        FALSE
6          5.4         3.9          1.7         0.4  setosa        FALSE
[1] ""
   Sepal.Length Sepal.Width Petal.Length Petal.Width
51          7.0         3.2          4.7         1.4
55          6.5         2.8          4.6         1.5
56          5.7         2.8          4.5         1.3
57          6.3         3.3          4.7         1.6
58          4.9         2.4          3.3         1.0
59          6.6         2.9          4.6         1.3

In [8]:
#feed in all positive classes. SVM will use this to predict if class is same or not later on

svm.model<-svm(trainpredictors,y=NULL,
               type='one-classification',
               nu=0.10,
               scale=TRUE,
               kernel="radial")

#SVM params
# nu - what is the proportion of outliers we expect in the data?
# kernel -- type: linear, rbf (non linear function)
# gamma - param of the RBF kernel type and controls the influence of individual training samples
# this affects the smoothnees of the model. Low gamma has low var & high bias, high gamma is high var, low bias

#svm(x, y = NULL, scale = TRUE, type = NULL, kernel =
#"radial", degree = 3, gamma = if (is.vector(x)) 1 else 1 / ncol(x),
#coef0 = 0, cost = 1, nu = 0.5,
#class.weights = NULL, cachesize = 40, tolerance = 0.001, epsilon = 0.1,
#shrinking = TRUE, cross = 0, probability = FALSE, fitted = TRUE,
#..., subset, na.action = na.omit)

In [9]:
svm.predtrain<-predict(svm.model,trainpredictors)
svm.predtest<-predict(svm.model,testpredictors)

confTrain<-table(Predicted=svm.predtrain,Reference=trainLabels)
confTest<-table(Predicted=svm.predtest,Reference=testLabels)

confusionMatrix(confTest,positive='TRUE')


print(confTrain)
print(confTest)


Confusion Matrix and Statistics

         Reference
Predicted FALSE TRUE
    FALSE   100   12
    TRUE      0    6
                                          
               Accuracy : 0.8983          
                 95% CI : (0.8291, 0.9463)
    No Information Rate : 0.8475          
    P-Value [Acc > NIR] : 0.074334        
                                          
                  Kappa : 0.4587          
 Mcnemar's Test P-Value : 0.001496        
                                          
            Sensitivity : 0.33333         
            Specificity : 1.00000         
         Pos Pred Value : 1.00000         
         Neg Pred Value : 0.89286         
             Prevalence : 0.15254         
         Detection Rate : 0.05085         
   Detection Prevalence : 0.05085         
      Balanced Accuracy : 0.66667         
                                          
       'Positive' Class : TRUE            
                                          
         Reference
Predicted TRUE
    FALSE    5
    TRUE    27
         Reference
Predicted FALSE TRUE
    FALSE   100   12
    TRUE      0    6

In [ ]: